AITopics | layer output

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.67)

Neural Information Processing SystemsDec-24-2025, 06:14:25 GMT

Proximal Mapping for Deep Regularization

Underpinning the success of deep learning is effective regularizations that allow a variety of priors in data to be modeled.

deep regularization, name change, proximal mapping, (3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Zhao, Bo, Kapusuzoglu, Berkcan, Balasubramaniam, Kartik, Sahu, Sambit, Chakraborty, Supriyo, Winata, Genta Indra

Optimizing Reasoning Efficiency through Prompt Difficulty Prediction

arXiv.org Artificial IntelligenceNov-7-2025

Reasoning language models perform well on complex tasks but are costly to deploy due to their size and long reasoning traces. We propose a routing approach that assigns each problem to the smallest model likely to solve it, reducing compute without sacrificing accuracy. Using intermediate representations from s1.1-32B, we train lightweight predictors of problem difficulty or model correctness to guide routing across a pool of reasoning models. On diverse math benchmarks, routing improves efficiency over random assignment and matches s1.1-32B's performance while using significantly less compute. Our results demonstrate that difficulty-aware routing is effective for cost-efficient deployment of reasoning models.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

2511.03808

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

arXiv.org Artificial IntelligenceOct-29-2025

Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment

Gu, Jian, Aleti, Aldeida, Chen, Chunyang, Zhang, Hongyu

Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.

large language model, machine learning, natural language, (18 more...)

2510.24208

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > China > Chongqing Province > Chongqing (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

arXiv.org Artificial IntelligenceOct-22-2025

The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

Bafna, Niyati, Li, Tianjian, Murray, Kenton, Mortensen, David R., Yarowsky, David, Sirin, Hale, Khashabi, Daniel

Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages, but the causes for this are not well-understood. We first demonstrate the existence of an implicit task-solving-->translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage, despite task-solving success, is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We quantify the extent to which either stage in the pipeline is responsible for final failure for a word translation task across 108 language pairs, and find that the translation barrier explains a dominant portion of error for a majority of language pairs, and is especially severe for low-resource target languages. Our results highlight an important bottleneck for end-to-end multilingual generation, relevant for future work seeking to improve multilinguality in LLMs.

large language model, machine learning, natural language, (19 more...)

2506.22724

Country:

North America > Mexico > Mexico City > Mexico City (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(7 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Neural Information Processing SystemsOct-2-2025, 20:26:37 GMT

490640b43519c77281cb2f8471e61a71-Supplemental.pdf

artificial intelligence, experiment, machine learning, (17 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.47)

Heddes, Mike, Javanmard, Adel, Axiotis, Kyriakos, Fu, Gang, Bateni, MohammadHossein, Mirrokni, Vahab

DeepCrossAttention: Supercharging Transformer Residual Connections

arXiv.org Artificial IntelligenceFeb-10-2025

Transformer networks have achieved remarkable success across diverse domains, leveraging a variety of architectural innovations, including residual connections. However, traditional residual connections, which simply sum the outputs of previous layers, can dilute crucial information. This work introduces DeepCrossAttention (DCA), an approach that enhances residual learning in transformers. DCA employs learnable, input-dependent weights to dynamically combine layer outputs, enabling the model to selectively focus on the most relevant information in any of the previous layers. Furthermore, DCA incorporates depth-wise cross-attention, allowing for richer interactions between layers at different depths. Our language modeling experiments show that DCA achieves improved perplexity for a given training time. Moreover, DCA obtains the same model quality up to 3x faster while adding a negligible number of parameters. Theoretical analysis confirms that DCA provides an improved trade-off between accuracy and model size when the ratio of collective layer ranks to the ambient dimension falls below a critical threshold.

artificial intelligence, machine learning, natural language, (19 more...)

2502.06785

Country: North America > United States > California (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceJan-14-2025

AdaptVC: High Quality Voice Conversion with Adaptive Learning

Kim, Jaehun, Kim, Ji-Hoon, Choi, Yeunju, Nguyen, Tan Dat, Mun, Seongkyu, Chung, Joon Son

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.

adapter, information, proc, (15 more...)

2501.01347

Country:

Asia > South Korea (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

arXiv.org Artificial IntelligenceNov-15-2024

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Liu, Joseph, Geddes, Joshua, Guo, Ziyu, Jiang, Haomiao, Nandwana, Mahesh Kumar

Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.

artificial intelligence, machine learning, natural language, (14 more...)

2411.1051

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)
Information Technology > Artificial Intelligence > Natural Language (0.88)

Neural Information Processing SystemsOct-10-2024, 17:06:21 GMT

Proximal Mapping for Deep Regularization

Underpinning the success of deep learning is effective regularizations that allow a variety of priors in data to be modeled. However, most regularizers are specified in terms of hidden layer outputs, which are not themselves optimization variables. In contrast to prevalent methods that optimize them indirectly through model weights, we propose inserting proximal mapping as a new layer to the deep network, which directly and explicitly produces well regularized hidden layer outputs. The resulting technique is shown well connected to kernel warping and dropout, and novel algorithms were developed for robust temporal learning and multiview modeling, both outperforming state-of-the-art methods.

deep regularization, layer output, proximal mapping

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)